Efficient Extractive Summarization in a Metric Space

Introduction

In this article we explore how to efficiently select a representative subset of items from a larger set. We are given a "universe" (set) of points in a metric space and wish to summarize these points by selecting relatively few representative points. The goal is to choose the summary so that as many points in the universe as possible are as close as possible to a point in the summary. In this article we define the precise problem, mention an approximation algorithm in the literature, and present a much more efficient algorithm to achieve the same results as the approximation algorithm.

Precise Formulation

Let the universe \(U\) be a set of items in a metric space with metric \(d\). Let the "cost" of an item in \(U\) being a certain distance from an item in the summary be given by a positive, strictly increasing function \(c : \mathbb{R}^+ \Rightarrow \mathbb{R}^+\). Let \(r > 0\) be a radius such that if an item in \(U\) is not within distance \(r\) of any point in the summary, it doesn't matter how far it actually is (for example, we might consider the item not represented at all by the summary at that point). A summary \(S\) is a subset of \(U\).

Now we are ready for the main definitions. Given summary \(S\) and \(u \in U\), the minimum distance to the summary is \[\text{dMin}_S(u) = \min(\{d(u, s) \mid s \in S\} \cup \{r\}).\] The total cost of the summary is: \[C(S) := \sum_{u \in U \setminus S} c(\text{dMin}_S(u)).\] In other words, we sum over all items in \(U\) the cost \(c\) associated with each item's minimum distance to summary \(S\), where the minimum distance is capped at the radius \(r\). Note that \(C\) is strictly decreasing: adding any item to \(S\) decreases \(C\).

Given a target summary size, the objective is to choose \(S\) of that size minimizing the total cost. By a reduction from Geometric Set Cover, our problem is NP-Complete so we shouldn't hope for an exact and efficient solution. Our goal is to find an efficient solution that is not too far from the exact solution.

Prior Work

The paper [1] presents a polynomial-time approximation algorithm to this problem. The algorithm is greedy, starting with the summary \(S=\emptyset\) and repeatedly adding to \(S\) the item in \(U \setminus S\) that minimizes \(C\), or specifically, \(\text{argmin}_{u \in U \setminus S} C(S \cup \{u\})\).

This algorithm works by iteratively computing \(C(S \cup \{u\})\) for each \(u \in U \setminus S\), which requires iterating over \(u\) and for each \(u\), iterating over \(U \setminus S\) for each such \(u\) in the computation of \(C\). Thus each addition to \(S\) requires \(\theta(|U| - |S|)\) computation and if \(n=|U|\) and the final summary has size \(k \in o(n)\), the algorithm runs in time \(\in \theta(kn^2)\).

It's worth noting that this paper does not consider a radius \(r\)—in other words, it is as if \(r = \infty\) in our formulation. We add \(r\) as a mild modification to admit an efficient algorithm. In most applications far apart points are not remotely similar, so adding \(r\) does not materially change the problem.

Our Approach

Overview

Given an item \(u \in U \setminus S\), let \[\Delta\text{Cost}_S(u) = C(S) - C(S \cup \{u\}).\] Like the algorithm in [1], at each step our algorithm selects the element that most minimizes the total cost to add to the summary - that is, we add \(u^* := \text{argmax}_{u \in U \setminus S} \Delta\text{Cost}_S(u) = \text{argmin}_{u \in U \setminus S} C(S \cup \{u\})\) to \(S\). To efficiently find \(u^*\), we maintain a sorted set of all \(u \in U \setminus S\) sorted by \(\Delta\text{Cost}_S(u)\). The sorted set allows us to immediately select \(u^*\); however, upon selecting it, we need to compute \(\Delta\text{Cost}_{S \cup \{u^*\}}(v)\) for \(v \in U \setminus (S \cup \{u^*\})\). Our observation is that \(\Delta\text{Cost}_{S \cup \{u^*\}}(v) \ne \Delta\text{Cost}_S(v)\) for only specific \(v\), there is a simple criterion to determine these \(v\), and there is an efficient algorithm to obtain them.

Thus at a high level our algorithm is to start with \(S = \emptyset\) and repeat the following steps \(k\) times, where \(k\) is the desired size of the summary (or until some other stopping condition is reached):

Remove \(u^* := \text{argmax}_{u \in U \setminus S} \Delta\text{Cost}_S(u)\) from the sorted set.
Identify \(\{v \mid \Delta\text{Cost}_{S \cup \{u^*\}}(v) \ne \Delta\text{Cost}_S(v)\}\).
For each \(v\) identified above, compute \(\Delta\text{Cost}_{S \cup \{u^*\}}(v)\) and re-sort \(v\) in the sorted set.
Set \(S \leftarrow S \cup \{u^*\}\).

Steps 1 and 4 are trivial. The key observations to implementing steps 2 and 3 efficiently is that step 2 only requires examining items \(v\) that are close to \(u^*\) and step 3 only requires examining items that are close to these \(v\). Details follow.

Locality

The algorithm maintains \(\text{dMin}_S(v)\) for all \(v \in U \setminus S\). Let the "neighborhood" of an item \(v \in U\) be \(N(v) := \{w \in U \mid d(v, w) < r\}\). Note that \[\text{dMin}_S(v) = \begin{cases} \min_{s \in S \cap N(v)} d(s, v) & \text{ if } S \cap N(v) \ne \emptyset \\ r & \text{ otherwise} \end{cases}\] so that \(\text{dMin}_S(v)\) depends only on the neighbors of \(v\) that are in \(S\). Thus \(\text{dMin}_{S \cup \{u^*\}}(v) \ne \text{dMin}_S(v)\) only if \(v \in N(u^*)\). Furthermore, the only items for which \(\Delta\text{Cost}_{S \cup \{u^*\}}(v) \ne \Delta\text{Cost}_S(v)\) are those that are close to items for which \(\text{dMin}_{S \cup \{u^*\}}(v) \ne \text{dMin}_S(v)\)—this statement is stated more precisely in and follows from the lemma below.

These locality observations allow us to precisely specify steps 2 and 3 of the high level algorithm above. Specifically:

2a. Compute \(\text{dMinUpdate} := \{v \in N(u^*) \mid d(u^*, v) < \text{dMin}_S(v)\}\).
2b. Compute \(\Delta\text{CostUpdate} := \bigcup_{v \in \text{dMinUpdate}} \{w \in N(v) \mid d(v, w) < \text{dMin}_S(v)\}\).
3a. Compute \(\text{dMin}_{S \cup \{u^*\}}(v)\) for all \(v \in \text{dMinUpdate}\).
3b. Update \(\Delta\text{Cost}_{S \cup \{u^*\}}(w)\) for all \(w \in \Delta\text{CostUpdate}\).

Lemma

\[\Delta\text{Cost}_{S \cup \{u\}}(w) \ne \Delta\text{Cost}_S(w) \Leftrightarrow \exists v \in N(u) \text{ such that } \max \{d(u, v), d(v, w)\} < \text{dMin}_S(v)\}.\] Proof \[\Delta\text{Cost}_S(w) = \sum_{v \in N(w)} c(\text{dMin}_S(v)) - c(\text{dMin}_{S \cup \{w\}}(v))\] since terms for all other \(v \in U \setminus S\) are \(0\). For any \(v\) in the sum, the term for \(v\) in \(\Delta\text{Cost}_{S \cup \{u\}}(w)\) cannot be larger than in \(\Delta\text{Cost}_S(w)\), which can be seen by separately considering the cases \(d(u, v) \ge \text{dMin}_S(v)\), \(\text{dMin}_S(v) > d(u, v) > \text{dMin}_{S \cup \{w\}}(v)\), and \(\text{dMin}_{S \cup \{w\}}(v) \ge d(u, v)\). Therefore because \(c\) is strictly increasing, \(\Delta\text{Cost}_S(w) > \Delta\text{Cost}_{S \cup \{u\}}(w)\) iff \(\exists v \in N(w)\) such that \[c(\text{dMin}_S(v)) - c(\text{dMin}_{S \cup \{w\}}(v)) \ne c(\text{dMin}_{S \cup \{u\}}(v)) - c(\text{dMin}_{S \cup \{u, w\}}(v)). \label{*}\tag{*}\]

If \(d(u, v) \ge \text{dMin}_S(v)\) then the right hand side of \ref{*} equals the left; if \(d(u, w) \ge \text{dMin}_S(v)\) then both sides of \ref{*} are zero. Either way, \(v\)'s term in \(\Delta\text{Cost}(y)\) is unchanged.

Conversely, \[c(\text{dMin}_{S \cup \{u, w\}}(v)) = \min \{ c(\text{dMin}_{S \cup \{u\}}(v)), c(\text{dMin}_{S \cup \{w\}}(v)) \}. \label{**}\tag{**}\] If \(d(u, v)\) and \(d(v, w) < \text{dMin}_S(v)\) then \(c(\text{dMin}_{S \cup \{u\}}(v))\) and \(c(\text{dMin}_{S \cup \{w\}}(v)) < c(\text{dMin}_S(v))\) so regardless of which term on the right hand side of \ref{**} equals the left hand side, inequality must hold in \ref{*}. Therefore \(\Delta\text{Cost}_{S \cup \{u\}}(w) \ne \Delta\text{Cost}_S(w)\).

Complete Algorithm

Store \(U\) in a spatial index that can be searched for neighbors \(N(u) = \{v \in U \mid d(u, v) < r\}\). Start with \(S = \emptyset\), set \(\text{dMin}(u) = r\) and compute \(\Delta\text{Cost}(u) = \sum_{v \in N(u)} c(r) - c(d(u, v))\) for each \(u \in U\). Store \(U\) in a priority queue or sorted set ordered by \(\Delta\text{Cost}\) descending. Then repeat the following steps as many times as desired:

Remove the first element of the priority queue or sorted set and call it \(u^*\).
Compute \(\text{dMinUpdate} := \{v \in N(u^*) \mid d(u^*, v) < \text{dMin}_S(v) \}\).
Compute \(\Delta\text{CostUpdate} := \bigcup_{v \in \text{dMinUpdate}} \{ w \in N(v) \mid d(v, w) < \text{dMin}(v) \}\).
For all \(v \in \text{dMinUpdate}\), set \(\text{dMin}(v) \leftarrow d(u^*, v)\).
For all \(w \in \Delta\text{CostUpdate}\), set \(\Delta\text{Cost}(w) \leftarrow \sum_{x \in N(w), d(u^*, x) < \text{dMin}(x)} c(\text{dMin}(x)) - c(d(u^*, x))\) and re-sort it in the priority queue or sorted set.
Set \(S \leftarrow S \cup \{u^*\}\).

Complexity

Let \(n = |U|\), \(k = |S|\), and \(m \ge \max_{u \in U} |N(u)|\). Also let \(q\) upper bound the time to query the spatial index and access the priority queue or sorted set. Then this algorithm runs in time \(T \in O(n(q+m) + km^2(q+m))\). The space complexity is dominated by the spatial index. If all neighborhoods are pre-computed then the required space is \(\theta(mn)\); otherwise typically it is \(\theta(m)\).

Given \(T\)'s cubic dependence on \(m\), the performance of this algorithm critically depends on neighborhoods being small. Thus \(r\) should be chosen so that \(m \in o(n)\) and ideally \(m \in O(\log n)\). Efficient spatial indexing and sorted data structures typically allow \(q \in O(\log n)\). If both \(m\) and \(q \in O(\log n)\) then \(T \in O(n \log n + k (\log n)^3)\), which equals \(O(n \log n)\) if \(k \in O\left(\frac{n}{(\log n)^2}\right)\), as it typically would. Compare to the original algorithm's time \(\in \theta(kn^2)\).

Conclusion

We have demonstrated a simple and efficient algorithm for extractive summarization in a metric space. Given mild assumptions, it performs far faster than the original algorithm presented.

Code

You may find a reference implementation of the above algorithm here.

References

H. Lin, J. Bilmes and S. Xie, "Graph-based submodular selection for extractive summarization," 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, Merano, 2009, pp. 381-386.